Convergence of stochastic approximation

Aditya Mahajan

McGill University

Silviu I. Niculescu

University of Paris Saclay

30 Jul, 2025

Convergence of stochastic approximation

  • Asymptotic convergence of stochastic approximation \[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]

  • Find sufficient conditions under which \(θ_t \to θ^*\)

  • Since \(\{θ_t\}_{t \ge 0}\) is a stochastic process, we have different notions of convergence: almost sure, mean squared, in probability

    • Focus on almost sure (a.s.) convergence in this talk.

Hand-wavy convergence argument

Write SA iteration as \[ \frac{θ_{t+1} - θ_t}{α_t} = f(θ_t) + ξ_{t+1} \]

  • Iterates \(\{θ_t\}_{t \ge 0}\) may be viewed as noisy Euler discretization of the ODE \[ \dot θ = f(θ) \]

  • So, as time goes to infinity (i.e., \(\sum α_t = ∞\)), the iterates \(\{θ_t\}_{t \ge 0}\) should converge to the equilibrium point of the ODE

  • Provided noisy is not too large (which roughly translates to \(\sum α_t^2 < ∞\))

A brief (and incomplete) historical perspective

Let’s start with some examples

Consider a few examples of \(f(θ)\) where \(ξ_{t+1} \sim \text{Unif}[-1,1]\) and \(α_t = C/(t+1)\).

  1. \(f(θ) = - 0.5 θ\)
  2. \(f(θ) = - θ^3/(1 + θ^2)\)
  3. \(f(θ) = - θ \sqrt{\ABS{θ}}\)
  4. \(f(θ) = - θ^3\)

In all cases, the ODE \(\dot θ = f(θ)\) is globally asymptotically stable!

Example 1: \(f(θ) = -0.5 θ\)

Example 2: \(f(θ) = -θ^3/(1+θ^2)\)

Example 3: \(f(θ) = -θ\sqrt{\ABS{θ}}\)

Example 4: \(f(θ) = -θ^3\)

Objective of the tutorial

  • Explain sufficient conditions for convergence of stochastic approximation

  • Try to understand what is happening in the examples

  • Focus on almost sure convergence. Other forms of convergence have different sufficient conditions

  • Do not present a detailed literature review but only a few key references

Modeling assumptions (1/3)

Stochastic approximation. Given a filtration \(\{\ALPHABET F_t\}_{t \ge 0}\), adapted sequences \(\{α_t\}_{t \ge 0}\) and \(\{\xi_t\}_{t \ge 1}\), conisder the iteration:

\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]

Assumptions on the function

F1. There is a unique \(θ^* \in \reals^d\) such that \(f(θ^*) = 0\).

F2. The function \(f\) is globally Lipschitz continuous with constant \(L\). \[ \NORM{f(θ_1) - f(θ_2)}_2 \le L \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]

Modeling assumptions (2/3)

\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]

Assumptions on the noise

N1. Martingale difference noise. The noise is a Martingale difference sequence, i.e., \[ \EXP_t[ ξ_{t+1} ] = 0, \hskip 0.5em a.s. \]

N2. Controlled growth of variance. There exists a \(σ^2\) such that \[ \EXP_t[ \NORM{ξ_{t+1}}^2_2 ] \le σ^2(1 + \NORM{θ - θ^*}_2^2), \hskip 0.5em a.s. \]

Modeling assumptions (3/3)

Assumptions on learning rates

R1. \(\sum_{t \ge 0} α_t^2 < ∞\)

R2. \(\sum_{t \ge 0} α_t = ∞\)

Robbins Monro condition

Basic convergence result (ODE approach)

V0. \(θ^*\) is the globally asymptotically stable equilibrium point of the ODE \(\dot θ = f(θ)\).

Assume that the modeling assumptions hold. Then (V0) implies that \(θ_t \to θ^*\) a.s. on the set \(\{ ω : \lim_{t \to ∞} θ_t(ω) < ∞ \}\)

  • The result does not imply boundedness of iterates!

  • It only shows that along the sample paths where iterates remain bounded, they converge to the right limit.

How to ensure boundedness of iterates

  • (Borkar and Meyn 2000) fluid limit model of ODE and global asymptotic stability of limiting ODE

  • (Vidyasagar 2023) Sufficient conditions based on the properties of the Lyapunov function

Fluid limit model of Borkar and Meyn (2000)

  • For any \(r > 0\), define \(f_r(θ) = f(r θ)/r\).

BM There exists a function \(f_{∞} \colon \reals^d \to \reals\) such that \[ \lim_{r \to ∞} \frac{f(r θ)}{r} = f_{∞}(θ), \quad \forall θ \in \reals^d \] and origin is g.a.s. equilibrium of the fluid limit ODE \(\dot θ = f_{∞}(θ)\).

Assume that the modeling assumptions hold. Then

  • (BM) implies that \(\{θ_t\}_{t \ge 0}\) are bounded almost surely.
  • In addition, (V0) implies \(θ_t \to θ^*\) almost surely.

BM requires linear growth

\(f(θ)\) (BM) Converges a.s.
\(-0.5 θ\) Yes Yes
\(-θ^3/(1 + θ^2)\) No Yes
\(-θ\sqrt{\ABS{θ}}\) No No
\(-θ^3\) No No

Lyapunov conditions of Vidyasagar (2023)

There exists a twice-differentiable Lyapunov fn \(V \colon \reals^d \to \reals\) that satisfies

V1. Squared norm equivalent. There exist \(a,b > 0\) such that \[ a \NORM{θ - θ^*}_2^2 \le V(θ) \le b \NORM{θ - θ^*}_2^2, \quad \forall θ \in \reals^d. \]

V2. Smoothness. There exists \(M > 0\) such that \[ \NORM{\GRAD V(θ_1) - \GRAD V(θ_2) }_2 \le 2M \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]

V3. \(\dot V(θ) \coloneq \IP{ \GRAD V(θ)}{f(θ)} \le 0\) for all \(θ \in \reals^d\).

V4. There exists a function \(φ\) in class \(\mathcal{B}\) (i.e., \(φ(0) = 0\) and for all \(0 < ε < M < ∞\), \(\inf_{r \in (ε, M)} f(r) > 0\)) such that \[ \dot V(θ) \le - φ(\NORM{θ - θ^*}_2), \quad \forall θ \in \reals^d. \]

Lyapunov conditions of Vidyasagar (2023)

Assume that the modeling assumptions and (V1)–(V2) hold.

  • Then (V3) and (R1) imply that the iterates \(\{θ_t\}_{t \ge 1}\) are bounded a.s.

  • In addition, if (V4) and (R2) hold, then \(θ_t \to θ^*\) a.s.

  • (V1)–(V3) imply local asymptotic stability. Adding (V4) implies global asymptotic stability.

  • For boundedness of iterates, we only need local asymtotic stability (and does not require \(θ^*\) to be unique root of \(f\))

  • But for almost sure convergence, we need global asymptotic stability.

Let’s check the examples

\(f(θ)\) (BM) (V1)–(V4) Converges a.s.
\(-0.5 θ\) Yes Yes Yes
\(-θ^3/(1 + θ^2)\) No Yes Yes
\(-θ\sqrt{\ABS{θ}}\) No Yes No
\(-θ^3\) No Yes No
  • Both Borkar and Meyn (2000) and Vidyasagar (2023) provide sufficient conditions that assume \(f\) is globally Lipschitz

  • So, they do not apply when \(f\) grows super-linearly.

Conclusion

  • Simple illustration of almost sure convergence of stochastic approximation

  • Variations encountered in applications (especially RL)

    • Biased noise: Similar analysis goes through provided iterates remain bounded, i.e., \(\sum_{t \ge 0} α_t \EXP_t[ ξ_{t+1} ] < ∞.\)

    • Markov noise: \(ξ_{t+1} = F(θ_t, w_t) - \EXP_t[F(θ_t), w_t)]\), where \(\{w_t\}_{t \ge 0}\) is a Markov process that is controlled by \(θ_t\).

    • Asynchronous updates: Only one component of \(θ_t\) is updated at each time.

    • Multi-time scale updates: Multiple coupled SA updates, operating at different timescales.

  • Quite a rich area. Lot of interest in relaxing the regularity assumptions on \(f\) and \(V\), depending on specific application.

References

Benaim, M. 1996. A dynamical system approach to stochastic approximations. SIAM Journal on Control and Optimization 34, 2, 437–472. DOI: 10.1137/s0363012993253534.
Benaïm, M. and Hirsch, M.W. 1999. Stochastic approximation algorithms with constant step size whose average is cooperative. The Annals of Applied Probability 9, 1. DOI: 10.1214/aoap/1029962603.
Benveniste, A., Métivier, M., and Priouret, P. 1990. Adaptive algorithms and stochastic approximations. Springer Berlin Heidelberg. DOI: 10.1007/978-3-642-75894-2.
Borkar, V.S. 2008. Stochastic approximation. Hindustan Book Agency. DOI: 10.1007/978-93-86279-38-5.
Borkar, V.S. and Meyn, S.P. 2000. The o.d.e. Method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 2, 447–469. DOI: 10.1137/s0363012997331639.
Kushner, H.J. and Clark, D.S. 1978. Stochastic approximation methods for constrained and unconstrained systems. Springer New York. DOI: 10.1007/978-1-4684-9352-8.
Kushner, H.J. and Yin, G.G. 2003. Stochastic approximation and recursive algorithms and applications. Springer-Verlag. DOI: 10.1007/b97441.
Ljung, L. 1977. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control 22, 4, 551–575. DOI: 10.1109/tac.1977.1101561.
Prashanth, L.A. and Bhatnagar, S. 2025. Gradient-based algorithms for zeroth-order optimization. Foundations and Trends® in Optimization 8, 1–3, 1–332. DOI: 10.1561/2400000047.
Robbins, H. and Monro, S. 1951. A stochastic approximation method. The Annals of Mathematical Statistics 22, 3, 400–407. DOI: 10.1214/aoms/1177729586.
Vidyasagar, M. 2023. Convergence of stochastic approximation via martingale and converse Lyapunov methods. Mathematics of Control, Signals, and Systems 35, 2, 351–374. DOI: 10.1007/s00498-023-00342-9.